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Abstract 

Given a sample of size n from a population of individuals belonging to different species with unknown 
proportions, a popular problem of practical interest consists in making inference on the probability 
Dn{l) that the (n + l)-th draw coincides with a species with frequency I in the sample, for any 
I = 0,1,...,n. This paper contributes to the methodology of Bayesian nonparametric inference 
for Dn{l). Specifically, under the general framework of Gibbs-type priors we show how to derive 
credible intervals for a Bayesian nonparametric estimation of £)„(/), and we investigate the large n 
asymptotic behaviour of such an estimator. Of particular interest are special cases of our results 
obtained under the specification of the two parameter Poisson-Dirichlet prior and the normalized 
generalized Gamma prior, which are two of the most commonly used Gibbs-type priors. With respect 
to these two prior specifications, the proposed results are illustrated through a simulation study and 
a benchmark Expressed Sequence Tags dataset. To the best our knowledge, this illustration provides 
the first comparative study between the two parameter Poisson-Dirichlet prior and the normalized 
generalized Gamma prior in the context of Bayesian nonparemetric inference for Dn{l). 

Keywords: Asymptotics; Bayesian nonparametrics; credible intervals; discovery probability; Gibbs- 
type priors; Good-Turing estimator; normalized generalized Gamma prior; smoothing technique; two 
parameter Poisson-Dirichlet. 


1 Introduction 


The problem of estimating discovery probabilities arises when an experimenter is sampling 
from a population of individuals (Xj)j>i belonging to an (ideally) infinite number of species 
(Yi)i>i with unknown proportions {qi)i>i. Given an observable sample = {Xi,... ,Xn), 
interest lies in estimating the probability that the (n + l)-th draw coincides with a species 
with frequency I in Xn, for any I = 0,1,... ,n. This probability is denoted by Dn{l) and 
referred to as the /-discovery, while discovery probabilities is used to address this class of 
probabilities. In terms of the species proportions qi’s, we can write 

Dn{l) = (1) 

i>l 

where Ni^n denotes the frequency of the species 1) in the sample. Here Dn{0) is the propor¬ 
tion of yet unobserved species or, equivalently, the probability of discovering a new species. 
The reader is referred to Bunge and Fitzpatrick (1993) and Bunge et al. (2014) for compre¬ 
hensive reviews on the full range of statistical approaches, parametric and nonparametric, as 
well as frequentist and Bayesian, for estimating the /-discovery and related quantities. The 
term discovery probability is also used in the literature to refer to a more general class of 
probabilities that originate when considering an additional unobserved sample of size m > 0. 
For instance, in this framework and conditionally on Xn, Lijoi et al. (2007) consider the 
problem of estimating the probability that Xn+m+i is new, while Favaro et al. (2012) focus 
on the so-called m-step /-discovery, the probability that Xn+m+i coincides with a species that 
has been observed with frequency / in the enlarged sample of size n + m. According to this 
terminology, the discovery probability Dn{l) introduced in (1) is the 0-step /-discovery. 

The estimation of the /-discovery has found numerous applications in ecology and lin¬ 
guistics, and its importance has grown considerably in recent years, driven by challenging 
applications in bioinformatics, genetics, machine learning, design of experiments, etc. For 
examples, Efron and Thisted (1976) and Church and Gale (1991) discuss applications in em¬ 
pirical linguistics; Good (1953) and Chao and Lee (1992), among many others, discuss the 
probability of discovering new species of animals in a population; Mao and Lindsay (2002), 
Navarrete et al. (2008), Lijoi et al. (2007a), and Guindani et al. (2014) study applications in 
genomics and molecular biology; Zhang (2005) considers applications to network species sam¬ 
pling problems and data confidentiality; Caron and Fox (2015) discuss applications arising 
from bipartite and sparse random graphs; Rasmussen and Starr (1979) and Chao et al. (2009) 
investigate optimal stopping procedures in finding new species; Bubeck et al. (2013) study 
applications within the framework of multi-armed bandits for security analysis of electric 
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power systems. 

This paper contributes to the methodology of Bayesian nonparametric inference for Dn{l)- 
As observed in Lijoi et al. (2007) for the discovery probability of new species (0-discovery 
Dn{0)), a natural Bayesian nonparametric approach for estimating Dn{l) consists in random¬ 
izing the qiS. Specifically, consider the random probability measure Q = where 

(9i)i>i are nonnegative random weights such that 9* = 1 almost surely, and (Ti)i>i 

are random locations independent of {qi)i>i and independent and identically distributed as 
a nonatomic probability measure vq on a space X. Then, it is assumed that 


Xi\Q ~ g, i = l,...,n 

g ~ 


( 2 ) 


for any n > 1, where ^ is the prior distribution over the species composition. Under the 
Bayesian nonparametric model (2), the estimator of Dn{l) with respect to a squared loss 
function, say arises from the predictive distributions characterizing (Aj)j>i. Specify¬ 

ing g in the large class of Gibbs-type random probability measures by Pitman (2003), we 
consider the problem of deriving credible intervals for T)n(l), and study the large n asymptotic 
behaviour of T)n{l)- Before introducing our results, we review some aspects of Vni})- 


1.1 Preliminaries on !?„(/) 

Let Xn be a sample from a Gibbs-type random probability measure g, featuring Kn = kn 
species ... ,X'^^, the unique values of Xn recorded in order of appearance, with corre¬ 
sponding frequencies (A^i,n, • • ■, XK„,n) = (?^i,n, • • •, ?^fc„,r^)• Here for every i = 1,2,..., 
there exists a non-negative integer such that X* = Ig- and Ni^n = X^i,n, where (Un)n>i is 
the sequence of random atoms in the definition of Q. Let a G (0, 1 ) and {Vn,k)k<n,n>i be a tri¬ 
angular array of nonnegative weights such that Viq = 1 and V^^k = {n — crk)Vn+i^k + Vn+i,k+i- 
According to de Finetti’s representation theorem, X^ is part of an exchangeable sequence 
{Xi)i>i whose distribution has been characterized in Pitman (2003) and Gnedin and Pitman 
(2006) as follows: for any set A in the Borel sigma-algebra of X, 


P[A„+i G A I 




Ur 


n,kn 


t'o(A) 


Ui.+ l,fcn. 

^n,k„ 


kn 

^ ^ {P‘i,n 

i=l 


a)5x*{A). 


(3) 


The conditional probability (3) is referred to as the predictive distribution of Q. Two peculiar 
features of Q emerge directly from (3): the probability that Xn+i ^ {X ^,..., depends 

only on kn', the probability that Xn+i = X* depends only on {kn,ni^n)- See De Blasi et al. 
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(2015) for a review on Gibbs-type priors in Bayesian nonparametrics. 

Two of the most commonly used nonparametric priors are of Gibbs-type; the two-parameter 
Poisson-Dirichlet (PD) prior in Pitman (1995) and Pitman and Yor (1997); the normalized 
generalized Gamma (GG) prior in Pitman (2003) and Lijoi et al. (2007b) (see also Priinster 
(2002),James (2002),Lijoi and Priinster (2003), and Regazzini et al. (2003) for early appear¬ 
ance of normalized GG). The Dirichlet process of Ferguson (1973) can be recovered from both 
priors by letting a —)• 0. For any ct G (0,1), 0 > —a and r > 0, the predictive distributions 
of the two-parameter PD and the normalized GG priors are of the form (3) where Vn,k„, 
respectively, are 


nto\o+^<^) 

{0)n 


and 


(jkn IgT 

r(n) 


n—1 

E 

i=0 


n — 1 


(-r)T k 



(4) 


where {a)n ■= no<i<n-i(® + *) with (a)o := 1, and r(a, 6) := exp{—x}dx. See 

Pitman (1995); Lijoi et al. (2007b) for details on (4). According to (3), the parameter a 
admits an interpretation in terms of the distribution of the larger a, the higher is the 
number of species and, among these, most of them have small abundances. In other terms, 
the larger the a the flatter is the distribution of Kn- The parameters 9 and r are location 
parameters, the bigger they are the larger the expected number of species tends to be. 

Denote by M; „ the number of species with frequency I in and by mi^n the correspond¬ 
ing observed value. An estimator Vn{l) arises from (3) by suitably specifying the Borel set 
A. In particular, if Aq := X \ {Yj^,..., and Ai := {X* : Ni^n = 0) for any / = 1,..., n, 

then one has 


D„(0) = F[Xn+i G Ao I Xn] = E[g(Ao) | ( 5 ) 

^n,kn 

Vn{l) = P[Y„+1 G Al I Xn] = E[Q{Ai) I = {l- a)mi^n^^- (6) 

^n,kn 

Estimators (5) and (6) provide Bayesian counterparts to the celebrated Good-Turing estima¬ 
tor Vn{l) = {I + for any / = 0,1,..., n — 1, which is a frequentist nonparametric 

estimator of Dn{l) introduced in Good (1953). The most notable difference between Vn{l) 
and T>n{l) consists in the use of the information in Dn(0 is a function of and 

not of {kn,mi^n) as one would intuitively expect for an estimator of Dn{l)- See Favaro et al. 
(2012) for details. 

Under the two-parameter PD prior, Favaro et al. (2016) established a large n asymptotic 
relationship between Vn{l) and Vn{l)- Due to the irregular behaviour of the m^^m’s, the 
peculiar dependency on makes T)n{l) a sensible estimator only if I is sufficiently small 
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with respect to n. See for instance Good (1953) and Sampson (2001) for examples of absnrd 
estimates determined by Vn{l)- In order to overcome this drawback, Good (1953) suggested 
smoothing {mi^n)i>i to a more regular series (m-;^)/>i, where ^ = pikn with y’ = 
being nonnegative weights such that ~ resulting smoothed 

estimator is ^ 

Vn{l-,^) = (/ + 1 )^^^. 

n 

See Chapter 7 in Sampson (2001) and references therein for a comprehensive account on 
smoothing techniques for 'Dn{l)- According to Theorem 1 in Favaro et al. (2016), as n becomes 
large, T)n{l) is asymptotically equivalent to y'vT)), where denotes a smoothing rule 
such that 




cj(l 


- o-)z-i 


(7) 


While the smoothing approach was introduced as an ad hoc tool for post processing the 
irregular mien’s in order to improve the performance of Pn(0) Theorem 1 in Favaro et al. 
(2016) shows that, for a large sample size n, a similar smoothing mechanism underlies the 
Bayesian nonparametric framework (2) with a two-parameter PD prior. Interestingly, the 
smoothing rule has been proved to be a generalization of the Poisson smoothing rule 

discussed in Good (1953) and Engen (1978). 


1.2 Contributions of the paper and outline 

The problem of associating a measure of uncertainty to Bayesian nonparametric estimators 
for discovery probabilities was first addressed in Lijoi et al. (2007) where estimates of the 
probability of observing a new species are endowed with highest posterior density intervals. 
Favaro et al. (2016) derive asymptotic posterior credible intervals covering also the case of 
species already observed with a given frequency. These contributions ultimately rely on the 
presence of an additional unobserved sample. While the approach of Lijoi et al. (2007) cannot 
be used to associate a measure of uncertainty to Dri(O), where such additional sample is not 
considered, the approach of Favaro et al. (2016) could be taken to derive approximate credible 
intervals for Vnil), I = 0,1,... ,n. Nonetheless, due to the asymptotic nature of the approach, 
the resulting credible intervals are likely to perform poorly for moderate sample size n by 
underestimating the uncertainty associated to the estimators. They then leave essentially 
unaddressed the issue of quantifying the uncertainty associated to the estimators T>n{l)-, for 
/ = 0,1,..., n. In this paper we provide an answer to this problem. With a slight abuse of 
notation, throughout the paper we write X \ Y to denote a random variable whose distribution 
coincides with the conditional distribution of X given Y. Since Vn{l) = IE[(5(A/) | Xn], the 
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problem of deriving credible intervals for 'Dn{l) boils down to the problem of characterizing 
the distribution of Q{Ai) \ Xn, for any I = 0,1,..., n. Indeed this distribution takes on the 
interpretation of the posterior distribution of Dn{l) with respect to the sample X^. For any 
Gibbs-type priors we provide an explicit expression for £n,r{l) '■= | for any r > 

1. Due to the bounded support of Q{Ai) \ Xn, the sequence {£n,r{l))r>i characterizes uniquely 
the distribution of Q{Ai) \ Xn and, in principle, it can be used to obtain an approximate 
evaluation of such a distribution. In particular, under the two-parameter PD prior and the 
normalized GG prior we present an explicit and simple characterization of the distribution 
of Q{Ai) I Xn. 

We also study the large n asymptotic behaviour of T>n{l), thus extending Theorem 1 in 
Favaro et al. (2016) to Gibbs-type priors. Specifically, we show that, as n tends to infinity, 
'Dn(O) and Pn(0 asymptotically equivalent to = akn/n and 

respectively. In other terms, at the order of asymptotic equivalence, any Gibbs-type prior 
leads to the same approximating estimator 'D'n{l). As a corollary we obtain that T)n{l) is 
asymptotically equivalent to the smoothed Good-Turing estimator Vn{l', namely c^pd 

is invariant with respect to any Gibbs-type prior. Refinements of 'D'nil) are presented for 
the two-parameter PD prior and the normalized GG prior. A thorough study of the large 
n asymptotic behaviour of (3) reveals that for Vn^kn in (4) the estimator Vnil) admits large 
n asymptotic expansions whose first order truncations coincide with T>'n{l), and that second 
order truncations depend on 0 > — cr and r > 0, respectively, thus providing approximating 
estimators that differ. A discussion of these second order asymptotic refinements is presented 
with a view towards the problem of finding corresponding refinements of the relationship 
between Vn{^ and T>n{k, ^pb)- 

The estimators Vn{^ depend on the values assigned to the involved parameters (see e.g. 
the sensitivity analysis in (Favaro et ah, 2016) for the two-parameter PD case) that therefore 
must be suitably estimated, e.g. via an empirical Bayes approach. Taking into account 
the method used to estimate the parameters characterizing the underlying Gibbs-type prior 
would then make the analysis of the asymptotic behaviour of T>n{l) more thorough, but we 
consider the parameters as fixed. We want to stick to the original Bayesian nonparametric 
framework for the estimation of discovery probabilities, as set forth in Lijoi et al. (2007), and 
we believe that this best serves the purpose of comparing the asymptotic behaviour of the 
two classes of estimators, highlighting the effect of the parameters in both. 

Our results are illustrated in a simulation study and in the analysis of a benchmark dataset 
of Expressed Sequence Tags (ESTs), which are short cDNA sub-sequences highly relevant for 
gene identification in organisms (see Lijoi et ah, 2007a). To the best of our knowledge, only the 
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two-parameter PD prior has been so far applied in the context of Bayesian nonparametric 
inference for the discovery probability. We consider the two-parameter PD prior and the 
normalized GG prior. It turns out that the two-parameter PD prior leads to estimates of the 
/-discovery, as well as associated credible intervals, that are close to those obtained under the 
normalized GG prior specification. This surfaces due to a representation of the two-parameter 
PD prior in terms of a suitable mixture of normalized GG priors. Credible intervals for T>n{l) 
are also compared with corresponding confidence intervals for the Good-Turing estimator, 
which as obtained by Mao (2004) and Baayen (2001). A second numerical illustration is 
devoted to the large n asymptotic behaviour of T)n{l)-, by using simulated data we compare 
the exact estimator T>n{l) with its first order and second order approximations. 

In Section 2 we present some distributional results for Q{Ai) \ Xn] these results provide 
a fundamental tool for deriving credible intervals for the Bayesian nonparametric estimator 
'Dn(/)- In Section 3 we investigate the large n asymptotic behaviour of Vnil), and we discuss 
its relationship with smoothed Good-Turing estimators. Section 4 contains some numerical 
illustrations. Proofs, technical derivations and additional illustrations are available in the 
Appendix. 

2 Credible intervals for Vn{l) 

An integral representation for the characterizing the predictive distributions (3) was 

introduced by Pitman (2003), and leads to a useful parameterization for Gibbs-type priors. 
See also Gnedin and Pitman (2006) for details. For any a £ (0,1) let fa be the density 
function of a positive cr-stable random variable, /q^°° ex.p{—tx} fa{x)dx = exp{—for any 
t > 0. Then, for some nonnegative function h, one has 

fc„ r+oo i-l 

Vn,k„ = Vh,{n,k„) ■= faiil - p)t)dpdt. ( 1 ) 

According to (3) and (1), a Gibbs-type prior is parameterized by (cr, h, vq); we denote by Qh 
this Gibbs-type random probability measure. The expression (4) for the two-parameter PD 
prior is recovered from (1) by setting h{t) =p{t-,a,d) := ar{0)t~^/T{9/a), for any a £ (0,1) 
and 0 > —a. The expression (4) for the normalized GG prior is recovered from (1) by setting 
h{t) = g{t; cr, r) := expjr'^ — rt}, for any r > 0. See Section 5.4 in Pitman (2003) for details. 

Besides providing a parameterization for Gibbs-type priors, the representation (1) leads 
to a simple numerical evaluation of V/j Specifically, let Ba^b be a Beta random variable 

with parameter (a, h) and, for any a £ (0,1) and c > —1, let Sa,c be a positive random variable 
with density function fsa,ci^) — r(c(T -|- l)x~'^^ fa{x)/r(c 1). Sa,c is typically referred to as 
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the polynomially tilted cr-stable random variable. Simple algebraic manipulations of (1) lead 
to 




a 


kn — 1 


h,{n,k„) 


r{kr, 


r(n) 




5, 


cr,kn 


B. 


akn,n—crkr] 


( 2 ) 


with B„k„,n-akn independent of S^^kn- According to (2) a Monte Carlo evaluation of Vh^(n,kn) 
can be performed by sampling from B^i.^^n-ak„ and Sfj^k„- la this respect, an efficient rejec¬ 
tion sampling for S'o-,c has been proposed by Devroye (2009). The next theorem, combined 
with (2), provides a practical tool for obtaining an approximate evaluation of the credible 
intervals for Pri(0- 


Theorem 1. Let be a sample generated from Qh according to (2) and featuring = 
kn species, labelled by ,..., with corresponding frequencies {Ni^n, ■ ■ ■, XK„,n) = 

(ni,„,..., nk„,n)- For any set A in the Borel sigma-algebra of X, let Hn,kn{A) = 
a)5x*{A). Then, for any r > 1, the rth moment E[((5/i(A))'’ | X^^ coincides with 

^ ’'nV,t.(.4)+j,(i-a)+,), (3) 

i=0 0<jri<--<ii<i q=0 


Let Mn '■= ■ ■ ■ ,Mn,n) = ■ ■ ■ , 1 ^^ 71 , 71 ) be the frequency counts from a sample 

Xn from Qh- In order to obtain credible intervals for 25^(0 take two specihcations of the 
Borel set A: Aq = X \ {X^, ..., X^^} and A/ = {X* : Ni^n = 1}, for any I = 1,..., n. With 
them, (3) reduces to 

SnAO) = niQh{Ao)y I X„] = V {-iyYyA±^(n - akAu (4) 

fAo Vv Vh,(77,kn) 

SnAl) = niQh{Ai)Y I _ a)miAr, (5) 

respectively. Equations (4) and (5) take on the interpretation of the r-th moments of the 
posterior distribution of Dn{0) and Dn{l) under the specification of a Gibbs-type prior. In 
particular for r = 1, by using the recursion Vh^(n,k„) = (n - <ykn)Vh^(^n+i,kn) + Fh,(n+i,fe„-ri), 
(4) and (5) reduce to the Bayesian nonparametric estimators of Dn{l) displayed resp. in (5) 
and (6). 

The distribution of Qh{Ai) \ Xn is on [0,1] and, therefore, it is characterized by {£n,r{l))r>i- 
The approximation of a distribution given its moments is a longstanding problem which 
has been tackled by such approaches as expansions in polynomial bases, maximum entropy 
methods, and mixtures of distributions. For instance, the polynomial approach consists in 









approximating the density function of Qhi^i) \ with a linear combination of orthogonal 
polynomials, where the coefficients of the combination are determined by equating £n,r{l) 
with the moments of the approximating density. The higher the degree of the polynomials, 
or equivalently the number of moments used, the more accurate the approximation. As a 
rule of thumb, ten moments turn out to be enough in most cases. See Provost (2005) for 
details. The approximating density function of Qfi{Ai) \ Xn can then be used to obtain an 
approximate evaluation of the credible intervals for T)n{l). This is typically done by generating 
random variates, via rejection sampling, from the approximating distribution of Qh{Ai) \ Xn- 
See Arbel et al. (2016) for details. 

Under the specification of the two-parameter PD prior and the normalized GG prior, 
(4) and (5) lead to explicit and simple characterizations for the distributions of Qp{Ai) \ Xn 
and Qg{Ai) \ Xn, respectively. Let Ga,i be a Gamma random variable with parameter (a, 1) 
and, for any a G (0,1) and 6 > 0, let be a random variable with density function 
~ Gxjp{b^ — bx}fa{x). is typically referred to as the exponentially tilted u-stable 
random variable. Finally, define Wa,b = bRfj^b/{bRa,b + Ga,i), where Ga,i is independent of 
Ra,b- The random variable Wa,b is nonnegative and with values on the set [0,1]. 

Proposition 1. Let Xn be a sample generated from Qp according to (2) and featuring 
Rn = kn species with Mn = ■ ■ ■, ^n,n)- Let Zp be a nonnegative random variable with 

density function of the form 


fzJx) = 


a 


T{e/a + k„ 


-X 


O+crkn — l—x'^ 


ii(0,+oo) {x)- 


Then, Qp(Ao) | Xn — ^Vn—akn,Zp — Rd+akn,n—crkn and Qp(A;) | Xn — 

k^{l—cr)mi„,n—akri — (l—ij)mi„(.^ ^^n—akn,Zp) R{l—cT)mi„,9-\-n—(l—a)mi„‘ 

Proposition 2. Let Xn be a sample generated from Qg according to (2) and featuring 
Kn = kn species with Mn = ("kni^n, ■ ■ ■, ^n,n)- Let Zg be a nonnegative random variable with 
density function of the form 


ax^kn 1 exp{—X°'}l(.r,+oo)(3;) 


Then, Qgi^Ao^ | Xn ^^n—(7k„,Zg ^^^Qg{Ai^ | Xn B(^i_fj'jmi „,n—crkn — (l—a)mi „{^ ^^n—akn,Zg')- 


According to Propositions 1 and 2, the random variables Qp{Aq) \ Xn and Qg(Ao) | Xn 
have a common structure driven by the W random variable. Moreover, for any I = 1,..., n, 
Qp{Ai) I Xn and Qg{Ai) \ Xn are obtained by taking the same random proportion B^_n)nii „,n-akn- 
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of (1 — Wn-akn,Zp) and (1 — Wn-o-fc„,Zg), respectively. Under the specification of the two- 
parameter PD prior and the normalized GG prior, Propositions 1 and 2 provide practical 
tools for deriving credible intervals for the Bayesian nonparametric estimator T)n{l), for any 
I = 0,1,..., n. This is typically done by performing a numerical evaluation of appropriate 
quantiles of the distribution of Qp{Ai) \ Xn and Qg{Ai) \ Xn- In the special case of the Beta 
distribution, quantiles can be also determined explicitly as solutions of a certain class of 
non-linear ordinary differential equations. See Steinbrecher and Shaw (2008) and references 
therein for a detailed account on this approach. 

To obtain credible intervals for T>n{l), we generate random variates from Qp(Ai)\Xn 
and Qg{Ai)\Xn- With the two-parameter PD prior, sampling from Qp{Ai)\Xn for any 
I = 0,1,... ,n is straightforward, requiring generation of random variates from a Beta dis¬ 
tribution. With the normalized GG prior, sampling from Qp{Ai) \ Xn for any I = 0,1,..., n 
is also straightforward. As the density function of the transformed random variable Zg is 
log-concave, one can sample from Zg by means of the adaptive rejection sampling of Gilks 
and Wild (1992). Given Zg^ the problem of sampling from Wn-(jk„,Zg boils down to the prob¬ 
lem of generating random variates from the distribution of the exponentially tilted u-stable 
random variable Ra,Zg- This can be done by resorting to the efficient rejection sampling 
proposed by Devroye (2009). 

3 Large sample asymptotics for Vn{l) 

We investigate the large n asymptotic behavior of the estimator T’n(O) with a view towards 
its asymptotic relationships with smoothed Good-Turing estimators. Under a Gibbs-type 
prior, the most notable difference between the Good-Turing estimator 'Dn{^) and T)n{l) can 
be traced to the different use of the information contained in the sample Xn- Thus Dn(0) is 
a function of while Dn(0) is a function of kn, and T>n{l) is a function of while 

T)n{l) is a function of for any / = 1,..., n. Let an — bn mean that lim„^+oo anjbn = 1. 
We show that, as n tends to infinity, Dn(0 — where =5 ^pd is the smoothing rule 

displayed in (7). Such a result thus generalizes Theorem 1 in Favaro et al. (2016) to the 
entire class of Gibbs-type priors. The asymptotic results of this section hold almost surely, 
but the probabilistic formalization of this idea is postponed to the proofs in the Appendix. 

Theorem 2. For almost every sample Xn generated from Qh according to (2) and 
featuring Kn = kn species with Mn = {mi^n, ■ ■ ■, mn,n), we have 



( 1 ) 
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( 2 ) 


Vn{l) 


(l-cr) 


mi'. 


n 


+ o 


mi^r. 

n 


By a direct application of Proposition 13 in Pitman (2003) and Corollary 21 in Gnedin et 
al. (2007) we can write that, for almost every sample Xn from Qp, featuring Kn = kn species 
with Mn = • • • , mn,n), 


mi^n 


Cj(l - a)i-i 


kn 


(3) 


as n —>■ + 00 . By suitably combining (1) and (2) with (3), we obtain 




mi+i. 

n 


cs (/ + 1) 


{i+iy. 


n 


(4) 


for any / = 0,1,..., n. See the Appendix for details on (4). The first equivalence in (4) shows 
that, as n tends to inhnity, Vn{l) is asymptotically equal to the Good-Turing estimator Vnil), 
whereas the second equivalence shows that, as n tends to infinity, ^pd is a smoothing rule 
for the frequency counts mi^n in T>„(/). We refer to Section 2 in Favaro et al. (2016) for a 
relationship between the smoothing rule and the Poisson smoothing in Good (1953). 

A peculiar feature of is that it does not depend on the function h characterizing the 
Gibbs-type prior. Thus, for instance, ^pd is a smoothing rule for both the two-parameter 
PD prior and the normalized GG prior. This invariance property of J^pd is clearly deter¬ 
mined by the fact that the asymptotic equivalences in (4) arise by combining (3), which does 
not depend on h, with (1) and (2), which also do not depend of h. It is worth noticing 
that, unlike the smoothing rule S^pb, the corresponding smoothed estimator 'D{1 ; S^pb) does 
depend on h through kn- Indeed, according to model (2), Q is the data generating process 
and therefore the choice of a specific Gibbs-type prior Q or, in other terms, the specification 
of h, affects the distribution of Kn- Intuitively, smoothing rules depending on the function h, 
if any exists, necessarily require to combine refinements of the asymptotic expansions (1) and 
(2) with corresponding refinements of the asymptotic equivalence (3). Under the specification 
of the two-parameter PD prior and the normalized GG prior, the next propositions provide 
asymptotic rehnements of Theorem 2. 


Proposition 3. For almost every sample Xn generated from Qp according to (2) and 
featuring Kn = kn species with Mn = (mi,n, • ■ ■, mn,n), we have 


= Ml) = {I - af'-'-" 


n n 


n 


n 


e , 

1--)+o 
n 


mi- 
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Proposition 4. For almost every sample generated from Qg according to (2) and 
featuring Kn = kn species with Mn = • • •, rnn,n), we have 


VniO) 


(7 kn 

n 


+ rk^ + o 



Vn{l) 



In Propositions 3 and 4, we introduce second order approximations of 'Dn(O) and Vn{l) by 
considering a two-term truncation of the corresponding asymptotic series expansions. Here 
it is sufficient to include the second term in order to introduce the dependency on 0 > —a 
and r > 0, respectively, and then the approximations of Vn{^) and T>n{l) differ between the 
two-parameter PD prior and the normalized GG prior. 

The second order approximations in Propositions 3 and 4, in combination with corre¬ 
sponding second order refinements of (3), do not lead to a second order refinement of (4). A 
second order refinement of (3), arising from Gnedin et al. (2007), can be expressed as 




a{l 


- cr)i-i 

l\ 


Kn + O 



(5) 


but second order terms in Propositions 3 and 4 are absorbed by O [Knl'^^ in (5). Further¬ 
more, even if a finer version of (5) was available, its combination with Propositions 3 and 4 
would produce higher order terms preventing the resulting expression from being interpreted 
as a Good-Turing estimator and, therefore, any smoothing rule from being elicited. In other 
terms, under the two-parameter PD and the normalized GG priors, the relationship between 
T>n{k) and T)n{l) only holds at the order of asymptotic equivalence. Theorem 2 and Proposi¬ 
tion 4, as to the normalized GG prior, provide useful approximations that might dramatically 
fasten up the evaluation of T)n{l), for I = 0,1,..., n, when n is large, by avoiding the Monte 
Carlo evaluation of the Vn,kJ^ appearing in (5) and (6). 


4 Illustrations 

We illustrate our results with simulations and analysis of data. Data were generated from 
the Zeta distribution, whose power law behavior is common in a variety of applications. See 
Sampson (2001) and references therein for applications of the Zeta distribution in empirical 
linguistics. One has P[Z = z] = z~^/C{s), for z = {1,2,...} and s > 1, where C{s) = 
We took s = 1.1 (case s = 1.5, typically leading to samples with a smaller number 
of distinct values, is presented in the Appendix). We drew 500 samples of size n = 1, 000 
from Z, ordered them according to the number of observed species kn, and split them into 5 
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groups; for i = 1, 2,..., 5, the i-th group of samples was composed of 100 samples featuring 
a total number of observed species kn between the quantiles of order (i — l)/5 and i/5 of 
the empirical distribution of kn- Then we chose at random one sample for each group and 
labeled it with the corresponding index i, leading to five samples (see Table 1). 

We also considered ESTs data generated by sequencing two Naegleria gruheri complemen¬ 
tary DNA libraries; these were prepared from cells grown under different culture conditions, 
aerobic and anaerobic conditions. The rate of gene discovery depends on the degree of 
redundancy of the library from which such sequences are obtained. Correctly estimating 
the relative redundancy of such libraries, as well as other quantities such as the proba¬ 
bility of sampling a new or a rarely observed gene, is of importance since it allows one 
to optimize the use of expensive experimental sampling techniques. The Naegleria gru¬ 
heri aerobic library consists of n = 959 ESTs with kn = 473 distinct genes and m /^959 = 
346, 57,19,12, 9, 5,4, 2,4, 5,4,1,1,1,1,1,1, for / = {1, 2,..., 12} U {16,17,18} U {27} U {55}. 
The Naegleria gruheri anaerobic library consists of n = 969 ESTs with kn = 631 distinct 
genes and mi^gQg = 491, 72, 30,9,13, 5, 3,1, 2, 0,1,0,1, for I G {1, 2,..., 13} (see Table 1). We 
refer to Susko and Roger (2004) for a detailed account on the Naegleria gruheri libraries. 

We focused on the two-parameter PD prior and the normalized GG prior. We choose 
the values of (cr, 9) and {a, r) by an empirical Bayes approach, as those that maximized the 
likelihood function with respect to the sample Xn featuring Kn = kn and (W.n, • ■ •, NK„,n) = 

{'^'1,711 ■ ■ ■ 1 f^kn,ri]l 


(i,. e) = argmax ^ A" «> + 

(a,e) { {9)n 


(it, f) = arg max ■ 
(o'-t) 




- 1 ) 


( 1 ) 

( 2 ) 


As first observed by Favaro et al. (2009), under the specification of the two-parameter PD 
prior and for a relatively large observed sample, there is a high concentration of the posterior 
distribution of the parameter {a, 9) around [a, 9). It can be checked that, under the specifi¬ 
cation of a normalized GG prior, a similar behaviour characterizes the posterior distribution 
of (cr, r). 

Table 1 reports the sample size n, the number of species km and the values of {a, 9) and 
((T,f) obtained by the maximizations (1) and (2), respectively. Here the value of a obtained 
under the two-parameter PD prior coincides, up to a negligible error, with the value of a 
obtained under the normalized GG prior. In general, we expect the same behaviour for any 
Gibbs-type prior in light of the likelihood function of a sample Xn from a Gibbs-type random 
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Table 1: Simulated data and Naegleria gruberi libraries. For each sample we report the sample size 
n, number of species kn and maximum likelihood values (d, d) and (d,f). 



PD 

GG 


sample 

n 

kn 

a 

e 

(J 

T 


1 

1,000 

642 

0.914 

2.086 

0.913 

2.517 


2 

1,000 

650 

0.905 

3.812 

0.905 

4.924 

Simulated data 

3 

1,000 

656 

0.910 

3.236 

0.910 

4.060 


4 

1,000 

663 

0.916 

2.597 

0.916 

3.156 


5 

1,000 

688 

0.920 

3.438 

0.920 

4.225 

Naegleria 

Aerobic 

959 

473 

0.669 

46.241 

0.684 

334.334 

Anaerobic 

969 

631 

0.656 

155.408 

0.656 

4151.075 


probability measure Qh, 

T{n-akn) Jo Jo 

Apart from a, any other parameter is introduced in (3) via the function h, which does not 
depend on the sample size n and the number of species kn- Then, for large n and kn the 
maximization of (3) with respect to a should lead to a value d very close to the value that 
would be obtained by maximizing (3) with h{t) = 1. 

4.1 Credible intervals 

We applied Propositions 1 and 2 in order to provide credible intervals for the Bayesian 
nonparametric estimator 'Dn{l)- For the two-parameter PD prior, for / = 0 we generated 
5,000 draws from the beta n-uk while, for I > 1 we sampled 5,000 draws from the 

distribution of a beta random variable oj^n-(i-a)mi ■ both cases, we computed 

the quantiles of order {0.025,0.975} of the empirical distribution and obtained 95% posterior 
credible intervals for T>n{l)- The procedure for the normalized GG case was only slightly 
more elaborate. By exploiting the adaptive rejection algorithm of Gilks and Wild (1992), we 
sampled 5,000 draws from Zg with density function (6). In turn, we sampled 5,000 draws from 
Wn-ak„,Zg- We then used the quantiles of order {0.025,0.975} of the empirical distribution of 
Wn-ak„,Zg to obtain 95% posterior credible intervals for T’n(O). Similarly, if Z > 1, we sampled 
5, 000 draws from the beta n,n-akn-{i-d)mi „ used the quantiles of the empirical 

distribution of -Wn-akn,Zg) as extremes of the posterior credible 

interval for T)n{l)- Under the two-parameter PD prior and the normalized GG prior, and 
with respect to these data, the top panel of Table 2 shows the estimated /-discoveries, for 
/ = 0,1,5,10, and the corresponding 95% posterior credible intervals. It is apparent that 
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Table 2: Simulated data (top panel) and Naegleria gruberi aerobic and anaerobic libraries (bottom 
panel). We report the true value of the probability !?„(/) (available for simulated data only) and the 
Bayesian nonparametric estimates of Dn{l) with 95% credible intervals for Z = 0,1, 5,10. 



Good"Turing 

PD 

GG 1 

1 

sample 

D„{1) 

©„(/) 

95%-c.i. 

On(Z) 

95%-c.i. 


95% 

-c.i. 


1 

0.599 

0.588 

(0.440, 0.736) 

0.587 

(0.557, 0.618) 

0.588 

(0.558, 

0.620) 


2 

0.592 

0.590 

(0.454, 0.726) 

0.590 

(0.559, 0.621) 

0.591 

(0.562, 

0.620) 

0 

3 

0.600 

0.599 

(0.462, 0.736) 

0.598 

(0.568, 0.628) 

0.599 

(0.567, 

0.630) 


4 

0.605 

0.609 

(0.473, 0.745) 

0.609 

(0.579, 0.638) 

0.608 

(0.577, 

0.638) 


5 

0.599 

0.634 

(0.499, 0.769) 

0.634 

(0.603, 0.664) 

0.635 

(0.604, 

0.663) 


1 

0.050 

0.044 

(0.037, 0.051) 

0.051 

(0.038, 0.065) 

0.051 

(0.038, 

0.065) 


2 

0.052 

0.054 

(0.046, 0.062) 

0.056 

(0.043, 0.071) 

0.055 

(0.042, 

0.070) 

1 

3 

0.051 

0.046 

(0.039, 0.053) 

0.054 

(0.040, 0.068) 

0.053 

(0.040, 

0.068) 


4 

0.055 

0.046 

(0.039, 0.053) 

0.051 

(0.038, 0.065) 

0.051 

(0.038, 

0.065) 


5 

0.061 

0.052 

(0.045, 0.059) 

0.051 

(0.038, 0.065) 

0.050 

(0.038, 

0.064) 


1 

0.015 

0.030 

(0.022, 0.038) 

0.016 

(0.009, 0.025) 

0.016 

(0.009, 

0.025) 


2 

0.022 

0 

(0, 0) 

0.016 

(0.009, 0.025) 

0.016 

(0.009, 

0.025) 

5 

3 

0.019 

0.012 

(0.008, 0.016) 

0.020 

(0.013, 0.030) 

0.021 

(0.012, 

0.030) 


4 

0.015 

0.006 

(0.003, 0.009) 

0.020 

(0.013, 0.030) 

0.021 

(0.013, 

0.031) 


5 

0.007 

0.012 

(0.007, 0.017) 

0.008 

(0.004, 0.015) 

0.008 

(0.003, 

0.015) 


1 

0 

0.011 

n.a. 

0 

(0, 0) 

0 

(0, 

0) 


2 

0.007 

0 

(0, 0) 

0.009 

(0.004, 0.016) 

0.009 

(0.004, 

0.016) 

10 

3 

0.011 

0 

(0, 0) 

0.009 

(0.004, 0.016) 

0.009 

(0.004, 

0.016) 


4 

0.011 

0 

(0, 0) 

0.009 

(0.004, 0.016) 

0.009 

(0.004, 

0.016) 


5 

0 

0.011 

n.a. 

0 

(0, 0) 

0 

(0, 

0) 

n 

Aerobic 

n.a. 

0.361 

(0.293, 0.429) 

0.361 

(0.331, 0.391) 

0.361 

(0.332, 

0.389) 

U 

Anaerobic 

n.a. 

0.507 

(0.451, 0.562) 

0.509 

(0.478, 0.537) 

0.507 

(0.480, 

0.532) 

1 

Aerobic 

n.a. 

0.119 

(0.107, 0.131) 

0.114 

(0.095, 0.134) 

0.110 

(0.092, 

0.131) 

i 

Anaerobic 

n.a. 

0.149 

(0.135, 0.162) 

0.148 

(0.129, 0.169) 

0.150 

(0.131, 

0.172) 

c: 

Aerobic 

n.a. 

0.031 

(0.024, 0.038) 

0.039 

(0.028, 0.052) 

0.039 

(0.028, 

0.053) 

0 

Anaerobic 

n.a. 

0.031 

(0.024, 0.038) 

0.050 

(0.038, 0.064) 

0.050 

(0.038, 

0.064) 

1 n 

Aerobic 

n.a. 

0.046 

(0.037, 0.055) 

0.046 

(0.034, 0.060) 

0.047 

(0.034, 

0.061) 

±u 

Anaerobic 

n.a. 

0.011 

n.a. 

0 

(0, 0) 

0 

(0, 

0) 
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the two-parameter PD prior and the normalized GG prior lead to the same inferences for 
the /-discovery. Such a behaviour is mainly determined by the fact that the two-parameter 
PD prior, for any a G (0,1) and 0 > 0, can be viewed as a mixture of normalized GG 
priors. Specifically, let ^p{cr, 6) and ^g{cr, b) be the distributions of the corresponding random 
probability measures, and let be a Gamma random variable with parameter (0 /ct, 1). 

Then, according to Proposition 21 in Pitman and Yor (1997), £2p{a,9) = and 

specifying a two-parameter PD prior is equivalent to specifying a normalized GG prior with an 
Gamma hyper prior over the parameter . Table 2 allows us to compare the performance 
of the Bayesian nonparametric estimator T>n{l) and the Good-Turing estimator T)n{l). As 
expected, Good-Turing estimates are not reliable as soon as I is not very small compared 
to n. See, e.g., the cases I = 5 and / = 10. Of course these estimates may be improved by 
introducing a suitable smoothing rule for the frequency counts m;^„’s. We are not aware of a 
non-asymptotic approach for devising confidence intervals for D„(/), and found that different 
procedures are used according to the choice of / = 0 and / > 1. We relied on Mao (2004) 
for I = 0 and on Church and Gale (1991) for I > 1. See also Baayen (2001) for details. 
We observe that the confidence intervals for Dn(/) are wider than the corresponding credible 
intervals for Dn(/) when I = 0, and narrower if / > 1. Differently from the credible intervals 
for Vnil), the confidence intervals for Dn(/) are symmetric about Dn(/); such a behaviour is 
determined by the Gaussian approximation used to derive conhdence intervals. 

4.2 Large sample approximations 

We analyzed the accuracy of the large n approximations of D„(/) introduced in Theorem 2, 
Propositions 3 and 4. We first compared the precision of exact and approximated estimators, 
while a second analysis compared the behavior of first and second order approximations for 
varying sample sizes. For the simulated data, the specification of the two-parameter PD 
prior and the normalized GG prior, and for I = 0,1,5,10, we compared the true discovery 
probabilities Dn{l) with the Bayesian nonparametric estimates of Dn{l) and with their corre¬ 
sponding first and second order approximations. From Table 1, the empirical Bayes estimates 
for cj can be slightly different under the two-parameter PD and the normalized GG priors. 
We considered only the first order approximation of Dn(/) with the parameter a = a set as 
indicated in (1). 

Results of this comparative study are reported in Table 3. We also include, as an 
overall measure of the performance of the exact and approximate estimators, the sum of 
squared errors (SSE), defined, for a generic estimator Dn{l) of the /-discovery, as SSE(D„) = 
Ylo<i<n(^riil) — dn{l))‘^, with dn{l) being the true value of Dn{l)- For all the considered 
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Table 3: Simulated data. We report the true value of the probability Dn{l), the Good-Turing estimates 
of Dn{l) and the exact and approximate Bayesian nonparametric estimates of Dn{l). 


1 

Sample 

1 

2 

3 

4 

5 


D„{1) 

0.599 

0.592 

0.600 

0.605 

0.599 



0.588 

0.590 

0.599 

0.609 

0.634 


'Dn{l) under PD 

0.587 

0.590 

0.598 

0.609 

0.634 

0 

under GG 

0.588 

0. 591 

0.599 

0.608 

0.635 


1st ord. 

0.587 

0.588 

0.597 

0.608 

0.633 


2nd ord. PD 

0.589 

0.592 

0.600 

0.610 

0.6366 


2nd ord. GG 

0.589 

0.592 

0.600 

0.610 

0.636 


D„{1) 

0.050 

0.052 

0.051 

0.055 

0.061 


T>n{l) 

0.044 

0.054 

0.046 

0.046 

0.052 


under PD 

0.051 

0.056 

0.054 

0.051 

0.051 

1 

V„{1) under GG 

0.051 

0.055 

0.053 

0.051 

0.050 


1st ord. 

0.051 

0.056 

0.054 

0.051 

0.051 


2nd ord. PD 

0.051 

0.056 

0.054 

0.051 

0.051 


2nd ord. GG 

0.051 

0.056 

0.054 

0.051 

0.0512 


D„{1) 

0.015 

0.022 

0.019 

0.015 

0.007 



0.030 

0 

0.012 

0.006 

0.012 


under PD 

0.016 

0.016 

0.020 

0.020 

0.008 

5 

r)„(0 under GG 

0.016 

0.016 

0.021 

0.021 

0.008 


1st ord. 

0.016 

0.016 

0.020 

0.020 

0.008 


2nd ord. PD 

0.016 

0.016 

0.020 

0.020 

0.008 


2nd ord. GG 

0.016 

0.016 

0.020 

0.020 

0.008 


D„{1) 

0 

0.007 

0.011 

0.011 

0 



0.011 

0 

0 

0 

0.011 


l3„(Z) under PD 

0 

0.009 

0.009 

0.009 

0 

10 

r>„(0 under GG 

0 

0.009 

0.009 

0.009 

0 


1st ord. 

0 

0.009 

0.009 

0.009 

0 


2nd ord. PD 

0 

0.009 

0.009 

0.009 

0 


2nd ord. GG 

0 

0.009 

0.009 

0.009 

0 


10^ X SSE(D„) 

289.266 

275.881 

256.886 

254.416 

255.655 

10' 

‘ X SSE(r)„) under PD 

3.534 

2.057 

1.137 

4.883 

15.437 

10^ X SSE(I)„) under GG 

3.399 

2.080 

1.149 

4.852 

15.045 

lO"* X SSE(r>„) 1st ord. 

3.780 

2.142 

1.180 

4.776 

14.456 

10^ 

X SSE(0„) 2st ord. PD 

3.275 

2.011 

1.128 

5.041 

17.007 

lO"* 

X SSE(D„) 2st ord. GG 

3.279 

2.014 

1.130 

5.035 

16.984 
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samples, there are not substantial differences between the SSEs of the exact Bayesian non- 
parametric estimates and the SSEs of the first and second order approximate Bayesian non- 
parametric estimates. The first order approximation is already pretty accurate and, thus, 
the approximation error does not contribute significantly to increase the SSE. As expected, 
the order of magnitude of the SSE referring to the not-smoothed Good-Turing estimator is 
much larger than the one corresponding to the Bayesian nonparametric estimators. 

We considered simulated data with sample sizes n = 10^, 10^, 10^, 10^. Eor every n, we 
drew ten samples from a Zeta distribution with parameter s = 1.1. We focused on the two- 
parameter PD prior, and for each sample we determined (d, 8) by means of the empirical 
Bayes procedure described in (1). We then evaluated, for every I = 0,1,..., n -|- 1, the 
exact estimator 'Dn(l) as well as its first and second order approximations. To compare the 
relative accuracy of the first and second order approximations 'Dn\l) and T)n\l) of the same 
estimator Pn(0 introduce the ratio ri^ 2 ,n of the sum of squared errors — 

T’n(O)^ for i = 1 over i = 2. We computed the coefficient ri^ 2 ,n for all the samples and, for each 
n, the average ratio ri, 2 ,n- We found the increasing values ri^ 2 ,n = 0.163,0.493,1.082,2.239 
for sizes n = 10^,10^,10^,10® (see Eigure SI in the Appendix). While for small n a first 
order approximation turns out to be more accurate, for large values of n (n > 10^ in our 
illustration), as expected, the second order approximation is more precise. 


A Appendix 


This appendix contains: i) the proofs of Theorem 1, Proposition 1, Proposition 2, Theorem 
2, Proposition 3 and Proposition 4; ii) details on the derivation of the asymptotic equivalence 
between Vn{l) and =>^pd); hi) additional application results. 

Let Xn = {Xi ,..., Xn) be a sample from a Gibbs-type RPM Q^. Recall that, due to the 
discreteness of Qh, the sample Xn features Kn = kn species, labelled by X^,... with 

corresponding frequencies {Ni^n, ■ ■ ■, XK„,n) = ■ j ?T-fc„,n)- Eurthermore, let = 

mi^n be the number of species with frequency I, namely Mi^n = „=z} such that 

^i,n = Kn and Yji<i<n = n. For any a € (0,1) let fa be the density function 
of a positive u-stable random variable. According to Proposition 13 in Pitman (2003), as 
n —>■ -|-oo 


Kn a.s.' « 
n" 


(AO.l) 


and 


Ml^n a.s. U(1 - a)l-i 






S. 


a,h^ 


(A0.2) 


where is a random variable with density function^ (s) = cj ^h{s ^^^)fa(s ^Z”"). 
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Note that by the fluctuation limits displayed in (AO.l) and (AO.2), as n tends to infinity the 
number of species with frequency I in a sample of size n from Qh becomes, almost surely, 
a proportion fT(l — of the total number of species in the sample. All the random 

variables introduced in this Appendix are meant to be assigned on a common probability 
space (O, P). 

A1 Proofs 

Proof of Theorem 1. We proceed by induction. Note that the result holds for r = 1, and 
obviously for any sample size n > 1. Let us assume that it holds for a given r > 1, and also 
for any sample size n > 1. Then, the (r + l)-th moment of Qh{A) \ can be written as 
follows 


nQl{A)\Xn] 

— / I £ A I Xji, — Xn+1, . . . , Xn-i^r — ^^n+r] 

Ja Ja 

X G dXn+r I X-n, = Xn+1, . . . , Xn-i^-r—l = X^+r—l] 

X • • • X P[An,+2 £ dXn+2 I Xji, Xn-^-l = Xn-(-l]P[A}^-|_i G dx^+l | Xn\ 

= [ E[Q^(A)|X„,A,,+i = x„+i] 

Ja 

( + / 1 \ I \ ^ \ r /- i \\ 

-i^o(dx„+i) + — - - (T)Ox*[dXn+l) . 




h,{n,kn) 


V^ 


h,{n,k„) 


Further, by the assumption on the r-th moment and by dividing A into (A \ Xn) U (A n Xn), 
one obtains 


E[Q 5 ;+^(A)|X„] 

^ T/ 

-^7- H {A)\ 

i=0 ^h,{n,kn) 


^r,i(^l-^n,kn(,A') + 1 ( t ) 


r +1 

+ E 

2 = 1 


^^+T + l,fcn+2’ + l —2 I 






h,{n,kn 


where we defined Rr,i{fJ.) := 'Eo<jl<■■■<ji<r-iU.l<l<i{^J- + a(1 - cj) + / - 1). The proof is 
completed by noting that, by means of simple algebraic manipulations, Rr+i^i{fJ,) = Rr,i{lJ- + 
1 — a)+fJ-Rr,i-i{fJ-+l)■ Note that when i^o(A) = 0 and i = r, the convention r'o(A)'’“* = 0*^ = 1 
is adopted. □ 

Proof of Proposition 1. Let us consider the Borel sets Aq := X \ {X^,..., 
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Ai := {X* : Ni^n = for any I = 1,... ,n. The two parameter PD prior is a Gibbs-type 
prior with h{t) = p{t;a,9) := aT{6)t~^/T{6/a), for any a G (0,1) and 9 > —a. Therefore 
one has Vn,k„ = y^p,{n,kn) = no<*<A:„-i(^ + fo)- % a direct application of Theorem 1 

we can write 

EIQU^o) I ^“1 = E ■ ''*'"** 

_ /n\ + 0'kn)r 

>^{9U9 + n)r 
{9 (J A^ri)r 

{9 + akn + n- akn)r ’ 

which is r-th moment of a Beta random variable with parameter {9 + ak,n — ak). Let us 
define the random variable Y = ZpR^^ Zp- Then, it can be easily verified that Y has density 
function 


fviy) = [ -fR„,Ay/z)fZp{z)dz 

Jo z: 


a 


r( 0 / a + kn) Jq 




a 


r{9/a + kn) 


ye+(Tk„-l^-y 




where, by Equation 60 in Pitman (2003), fa{u)du = T{9/a + kn)/crT{9 + akn). 

Hence T is a Gamma random variable with parameter {9 + akn, !)• Accordingly, we have 
Wn-akn,Zp = Be+akn,n-( 7 k„- Similarly, by a direct application of Theorem 1, for any Z > 1 we 
can write 


[a)n-\-r 

_ _ ((/ - a)mi^n)r _ 

{{I - a)mi^n)r + 9 + n- {I- a)mi^n ’ 

which is the r-th moment of a Beta random variable with parameter ({I — a)mi^n, 9 + n — {I — 
a)mi^n). Finally, the decomposition - 

Wn-ak„,Zp) follows from a characterization of Beta random variables in Theorem 1 in Jam- 
bunathan (1954). It can be also easily verihed by using the moments of Beta random variables. 
□ 


Proof of Proposition 2. Let us consider the Borel sets Aq := X \ {X^, ..., X^^} and 
Ai := {X* : Ni^n = 0) fo^ I = ^, ■ ■ ■ ,n. The two parameter PD prior is a Gibbs-type prior 
with h{t) = g{t', a, r) := expjr®^ — rt}, for any r > 0. By a direct application of Theorem 1 
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we can write 




(Al.l) 


aT{n) 


C*(T,T,n,fc„r(?T. Crkn) Jq 


'•I /* + oo 

Jo 


where 


/*+oo rl 


cjr(n) 


frj{wt)dwdt 


10 


n—1 

E 

i=0 


n — 1 


(-t)T(/c - ila\T^). 


Hereafter we show that (Al.l) coincides with the r-th moment of the random variable 
Wn-akr^,Zg- Given Zg = z\i is easy to find that the distribution of Wn-akn^z has the following 
density function 




expl^;”"} 
zT{n — kno) 


(1 - w)'^-kr.<z-i 



du. 


By randomizing over z with respect to the distribution of Zg provides the distribution of 
Wn-akn,Zg- Specifically, 


i'w) = 


a 


G^,.r,n,A:„r(n - 0-A:„) 


(1-u;) 


n—akn — 1 


/ oo 

Z-'^+^k^-^Z - tY 


-1 I 


dudz 


a 


GCT,r,n,A:„ 1^(71 O'k') 


( 1 -u;) 


n—crkn — 1 


) 1*00 

(z-r)”-i / r-"^’^e-‘^^(7/;t)dtdz 

Jo 


aT{n) 


Ccr,T,n,kn^ <7^n) 


(l-w) 


n—crkn — l I ^—crkn^—Tt 


^-akr^^-Ttf^ (Wt) dt. 


Therefore, 


E[K- 


crkntZg} 


aT{n) 


G(7^T-,n,fc„b(7l Crkn) Jq 


[ w‘^{l — w) 

Jo 


n—akn — 1 


^—akn^—rt 


fa (wt) dfdtc 


which coincides with (Al.l). We complete the proof by determining the distribution of the 
random variable Qg{Ai) \ Xn, for any / > 1. Again, by a direct application of Theorem 1 we 
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can write 


E[QliAi) I 

= {{I - a)mi^ri)', 


r{n-ak„+r) t exp{-Tt} Jq {1 - ^ f^{zt)dtdz 

nnZkr.) /o^“ exp{-rt} (1 - /^(zt)dtdz 

r(n - akn) 


dx 


r ((/ - (T)mi^n) r(El<i^i<n El<i^Z<n 

Jo 

f+co exp{-rt} /o (1 - z)"+’'-i-'"^"/^(2t)dtdz 
/o^°° exp{—rt} /^(l — z)’^“^“‘^^"/o-(2:t)dtd2: 

_r(n - akn) _ 

r {{I - a)mi^n) r(El<i^Z<n i^i,n “ El<i^Z<n ^i,n) 

Jo 

T§Sh) /o^“ exp{-Tt} /; x'-(l - z)^(l - z)"-i-""^(zt)dtdz 


r(n—crfc„) 


/o^°° exp{—rt} Jq^(1 — 2 ;)"’“^“°'*^"/(j(^;t)dtdz 


dx, 


which is the r-th moment of the scale mixture - W'„-aA:„,Zj), 

where W,i-(Tfc„,Zg is the random variable characterized above, and where the Beta random 
variable is independent of the random variable (1 - Wn-akr„Zg)- 

The proof is completed. □ 

Proof of Theorem 2. According to the fluctuation limit (AO.l) there exists a non¬ 
negative and finite random variable S^j^h such that n~^Kn as n —)• -l-oo. Let 

Oq := {w G n : lim„^+oo ?^~'^A'„(tz;) = 5o-,Zi(a;)}. Furthermore, let us define go,h{n,kn) = 
^/x,(n+i,fc„-ri)/V'h,(n,Zc„), where V'zi,(n,Zc„) = (y^"~^E{kn)E[h{S^^k„/B^k„,n-akJ\/E{n). Then we 
can write the following expression 


90,h kn) 


akn ® 

K 

3(7 ,^77, + ! 

El 

^akn + l,n+l — o-(kn + l) J 

n 

E 

hIb ) 

\ ^(Tkn,n-<7kn / 



(A1.2) 


We have to show that the ratio of the expectations in (A1.2) converges to 1 as n —?• + 00 . For 
this, it is sufficient to show that, as n —)■ + 00 , the random variable T( 7 ,n,Zc„ = Scr^k„/Bak„,n-akn 
converges almost surely to a random variable Tq-^zi- This is shown by computing the moment 
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of order r of , i-e. 


E(r; 


n,kn 


r(n) r{kn-r/a) ^ rf 
r(n - r) f{K) “ 


For any a; G rio the ratio njKn^{ lo) = njkn^^ converges to S^j/^{io) = Tfj^h{u) = t. 
Accordingly, rf jkn^ converges to E[T^{uj)] = f for any w G Oq- Since P[no] = 1) the 
almost sure limit, as n tends to infinity, of the random variable Tf^^n,K„ is identified with the 
nonnegative random variable which has density function ^{t) = h{t)fa{t). The proof 
is completed. 


Proof of Proposition 3. Let h{t) = p{t-,a,6) := aT{9)t~^/T{9/a), for any a G (0,1) 
and 9 > -a. Furthermore, let us define go,p{n,kn) = Pp,(n+i,A:„+i)/^p,(n,fc„) and gi,p{n,kn) = 
1 - Pp,(n+i,fc„+i)/'l^p,(n,fc„), so that we have go{n,kn) = {9 + akn)l{9 + n) and gi{n,kn) = 
l/{9 + n). Then, 


90,pij^i kn) — 


akr, 


n 


n 


n 


and 


1 


9 


gi,p{n,kn) = - 2 + ® ^ 


1 


H-1“ o ( — I (A1.3) 

(A1.4) 

n \n^ ' 

follow by a direct application of the Taylor series expansion to go{n,kn) and gi{n,kn), re¬ 
spectively, and then truncating the series at the second order. The proof is completed by 
combining (A1.3) and (Al.4) with the Bayesian nonparametric estimator T’n(0 under a two 
parameter PD prior. □ 


Proof of Proposition 4. The proof is along lines similar to the proof of Proposition 
3.2. in Ruggiero et ah (2015), which, however, considers a different parameterization for the 
normalized GG prior. Let h{t) = g{t\ a, r) := expjr'^ — rf}, for any a G (0,1) and r > 0, and 
let gOjgi'IT'j kn) ^g,(n+l,kn+l)/^g,{n,kn) Uud gi^pijl^kn) 1 '^g,(n+l,kn+l)/'^g,{n,kn)i where We 


have 




r(n) ,0 

Note that, by using the triangular relation characterizing the nonnegative weight 


we can write 


50,5(u, kfi) 


K 


g,{n,k„) 


(n (^kn)Vg^(^n+l,k„) ^ 

yg,in,kn) 



w{n, kn), 


where 


w{n,kn) 


x” exp{—[(r-|-a:)°'— r°']}(r-|-^ dx 
x^~^ exp{—[(r -|- x)®" — t^W{t + dx 


23 








Let us denote by f{x) the integrand function of the denominator of 1 — w{n,kn), and let 
fN{x) = r/(x)/(r + x). That is, fN{x) is the denominator of 1 — w{n, kn)- Therefore we can 


write 


1 - w{n,kn) 


/o°° Tf{x)/{T + x)dx 

ir da; 


Since f{x) is unimodal, by means of the Laplace approximation method it can be approxi¬ 
mated with a Gaussian kernel with mean x* = argmax^j^Q exp{ —[(r + xY — r'^jKr + 
^'jcrkn-n with variance — [(logo/)"(x*)]“^. The same holds for /Ar(x). Then, we obtain 
the approximation 


fNixlf)C{x*j^,-[{logofNy{xN)] 

^ fix*j,)C{x}j,-[{logofnx},)]-Y ’ 


where and x*j^ denote the modes of fjy and /, respectively, and where C{x,y) denotes 
the normalizing constant of a Gaussian kernel with mean x and variance y. Specifically, this 
yields to 

fNix^) ( (logo/7v)"(x^) 


1 - w{n,kn) 


- 1/2 


(A1.5) 


fixh) V (iogo/)"(x;^) 

The mode is the only positive real root of the function G{x) = (Tx(r+x)'^ —(n—l)r—— 
l)x. A study of G shows that x^ is bounded by below by a positive constant times 
which implies that the terms involving r are negligible in the following renormalization of 
G{x*j,) 


X 


^ I _ 


n 


D 


n 


n 


re — 1 

(7+1 


akr, — lx 


r — 


D 


n 


re" 


re 


The same calculation holds for x^. According to the fluctuation limit (AO.l) there exists 
a nonnegative and finite random variable such that n~^Kn S^^g as re —>■ +oo. Let 
flo ^ : lim„_,.+oo n~^Kn{w) = 5ct,/i(w)}, and let Sa,g{uj) = for any ui G flo- Then, 

we have 

rf* rp*_ 

(A1.6) 


"AT ^ TT ~ gi/o 


ry'T' rf'’- 

re re 

In order to make use of (Al.5), we also need an asymptotic equivalence for x^ — xj^. Note 
that G(x^) = 0 and G{x*^) = —x*^ allow us to resort to a first order Taylor bound on G 


at xT and shows that xT — xT has a lower bound equivalent to s 


(1 ai „2 The same 


re 


7^2 


argument applied to G(x) 
equivalence, thus 


X at x*j^ provides an upper bound with the same asymptotic 

(AI.7) 

rei (7 

By studying / and /at, as well as the second derivative of their logarithm, together with 
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asymptotic equivalences (A1.6) and (Al.7), we can write /(xj^) ~ /(x^) and (logo/)"(xJ)) ~ 
(logo/)"(x^) ~ (log o/ 7 v)''(x^). Hence, from (Al.5) one obtains l — w{n,kn) — r/(r + xj^) ~ 
jn, which leads to 


go,g{n,kn) = 1 - ^1 - (l- +o(- 


crkn _i/^ 1 f I 

-h - + O - 

n n \n 


and 


1 -go.g(ra,fcn) ^ 1 + 

n — akn 


g\,g{n,kn) = 


n \ n 


-l/o 


n 


+ o( - 
n 


_ crk 


(A1.8) 


(A1.9) 


Expressions (A1.8) and (Al.9) provide second order approximations of go,g{n, kn) and gi,g(n, kn), 
respectively. Recall that for any oj in Oq we have n~^kn ~ s^, namely we can replace with 
n~^kn- This is because of the fluctuation limit displayed in (AO.l). The proof is completed 
by combining (Al.8) and (Al.9) with the Bayesian nonparametric estimator T)n{l) under a 
normalized GG prior. □ 


A2 Details on the derivation of T>n{l) ~ 


Let us define Ca,i = cr(l - a)i-i/l\ and recall that Vn{0) = 14+i,A:„+i/K,fc„ and Vn{l) = 
{I — cr)mi^nVn+i,k„/^n,kn- The relationship between the Bayesian nonparametric estimator 
'Dn{l) and the smoothed Good-Turing estimator 'Dn{k, ^pb) follows by combining Theorem 
2 with the fluctuation limits (AO.l) and (AO.2). For any w G H, a version of the predictive 
distributions of Qa,h is 






Vr 


n,Kn{u)) 


M-) + 


K 




n+l,Kn{ui) 


'^n,Kn{u}) 




According to (AO.l) and (AO.2), lim„_>+oo = 1 almost surely. See Lemma 3.11 

in Pitman (2006) for additional details. By Theorem 2 we have Vn+i,K„+i/yn,Kn — o'Kn/n, 
and Mi^n — o’Kn, as n —)• -|-oo. Then, a version of the Bayesian nonparametric estimator of 
the 0-discovery coincides with 


yn+l,Kr,[uj)+l ^ CrKnjuj) 

^n,Kn{ijj) ^ 


(A2.1) 
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n 


as n —)• + 00 . By Theorem 2 we have Vn+i,Kn/^n,Kn — 1/^) ~ Ca^iKn, as n —)> +oo. 

Accordingly, a version of the Bayesian nonparametric estimator of the Z-discovery coincides 
with 


{I - a)Mi^n{t^) 


'^n+l,K„{Lj) 

^n,K„(uj) 


C::! {I — a) 


^ Ca,l{^ 


^ (Z + 1) 


n 

- a) 

n 

n 


(A2.2) 


as n + 00 . Let Oq := {^^ G : lim„^+oo n ""Kn{w) = lim„^+oo = 

Ca,iZafi/a{^)}- From (AO.l) and (AO.2) we have P[f2o] = 1. Fix a; G Oq and denote by 
kn = Kn{oj) and the number of species generated and the number of species 

with frequency Z generated by the sample Xn{oj)- Accordingly, Vn{l) — Vn{k, follows 

from (A2.1) and (A2.2). 


A3 Additional illustrations 

In this Section we provide additional illustrations accompanying those of Section 4 in the main 
manuscript. Specifically, we consider a Zeta distribution with parameter s = 1.5. We draw 
500 samples of size n = 1000 from such distribution, we order them according to the number 
of observed species kn, and we split them in 5 groups: for i = 1, 2,..., 5, the Z-th group of 
samples will be composed by 100 samples featuring a total number of observed species kn 
that stays between the quantiles of order (i — l)/5 and Z/5 of the empirical distribution of 
kn- Then we pick at random one sample for each group and label it with the corresponding 
index i. This procedure leads to five samples. As shown in Table SI, the choice of s = 1.5 
leads to samples with a smaller number of distinct values if compared with the case s = 1.1 
(see also Table 1 in the main manuscript). Table S2, under the two parameter PD prior 
and the normalized GG prior, shows the estimated Z-discoveries, for Z = 0,1,5,10, and the 
corresponding 95% posterior credible intervals. Finally, Figure SI shows how the average 
ratio ri, 2 ,n evolves as the sample size increases (see Section 4.2 in the main manuscript). 
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Table SI; Simulated data with s = 1.5. For each sample we report the sample size n, the 
number of species kn and the maximum likelihood values {a, 9) and {a, f). 



PD 

GG 

sample 

n 


a 

e 

a 

f 

1 

1000 

128 

0.624 

1.207 

0.622 

3.106 

2 

1000 

135 

0.675 

0.565 

0.673 

0.957 

Simulated data 3 

1000 

138 

0.684 

0.487 

0.682 

0.795 

4 

1000 

146 

0.656 

1.072 

0.655 

2.302 

5 

1000 

149 

0.706 

0.377 

0.704 

0.592 


Table S2: Simulated data with s = 1.5. We report the true value of the probability Dn{l) 
and the Bayesian nonparametric estimates of Dn{l) with 95% credible intervals. 



Good-Turing 

PD 

GG 

1 

sample 

D„(0 

T„(0 

95%-c.i. 

D„(0 

95%-c.i. 

T„(0 

95%-c.i. 


1 

0.099 

0.080 

(0.010, 0.150) 

0.081 

(0.065, 0.098) 

0.081 

(0.065, 0.098) 


2 

0.103 

0.092 

(0.012, 0.172) 

0.092 

(0.075, 0.110) 

0.091 

(0.075, 0.110) 

0 

3 

0.095 

0.096 

(0.014, 0.178) 

0.095 

(0.078, 0.114) 

0.095 

(0.076, 0.113) 


4 

0.096 

0.096 

(0.015, 0.177) 

0.097 

(0.079, 0.116) 

0.097 

(0.080, 0.115) 


5 

0.093 

0.108 

(0.019, 0.197) 

0.106 

(0.087, 0.126) 

0.105 

(0.087, 0.124) 


1 

0.030 

0.038 

(0.031, 0.045 ) 

0.030 

(0.020, 0.042) 

0.030 

(0.021, 0.042) 


2 

0.037 

0.030 

(0.024, 0.036) 

0.030 

(0.021, 0.041) 

0.030 

(0.020, 0.042) 

1 

3 

0.034 

0.034 

(0.028, 0.040) 

0.030 

(0.021, 0.042) 

0.031 

(0.021, 0.042) 


4 

0.029 

0.040 

(0.033, 0.047) 

0.033 

(0.023, 0.045) 

0.033 

(0.022, 0.044) 


5 

0.040 

0.026 

(0.021, 0.031) 

0.032 

(0.022, 0.044) 

0.032 

(0.023, 0.043) 


1 

0.013 

0.012 

(0.008, 0.016) 

0.013 

(0.007, 0.021) 

0.013 

(0.007, 0.021) 


2 

0.011 

0.006 

(0.003, 0.009) 

0.004 

(0.001, 0.009) 

0.004 

(0.001, 0.009) 

5 

3 

0.010 

0.012 

(0.007, 0.017) 

0.009 

(0.004, 0.015) 

0.009 

(0.004, 0.016) 


4 

0.010 

0.036 

(0.024, 0.048) 

0.009 

(0.004, 0.015) 

0.009 

(0.004, 0.015) 


5 

0.012 

0 

(0, 0) 

0.013 

(0.007, 0.021) 

0.013 

(0.006, 0.021) 


1 

0.019 

0 

(0, 0) 

0.019 

(0.011, 0.028) 

0.019 

(0.011, 0.028) 


2 

0 

0.011 

n.a. 

0 

(0, 0) 

0 

(0,0) 

10 

3 

0.011 

0.011 

(0.006, 0.016) 

0.009 

(0.004, 0.016) 

0.009 

(0.004, 0.016) 


4 

0 

0 

n.a. 

0 

(0,0) 

0 

(0,0) 


5 

0.006 

0 

(0, 0) 

0.009 

(0.004, 0.016) 

0.009 

(0.004, 0.017) 


27 





2.5 



11 


Figure SI: Average ratio ri, 2 ,n of sums of squared approximation errors for different sample 
sizes n = 10^, 10^, 10^, 10®. For the x-axis a logarithmic scale was used. 
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